In this notebook, we visualize the distribution of classes in the Kaggle Right Whale Recognition Challenge data set. In a later notebook, we will learn how to augment our data set by applying affine transforms to the images, but, for right now, we will stick to using features made from the data set given by Kaggle.
In [1]:
%matplotlib inline
#the above call us to display the seaborn plots within the IPython notebook
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
In [2]:
data = pd.read_csv("/Users/.../Machine Learning Competitions/Kaggle/Right Whale Recognition Challenge/features/rgbHistogramsTrainSet8Bins.csv", sep = ",");
As you can see, the images in the Kaggle data set are far from being evenly distributed. Many classes have fewer than ten observations while, on the other extreme, a couple of classes have more than forty observations.
In [4]:
plot = sns.countplot(x="WhaleID", data=data, palette = "Blues_d");
# Link: ax.xaxis.set_major_formatter(plt.NullFormatter())
plot.xaxis.set_major_formatter(plt.NullFormatter())
Now let's see how many observations we have in total.
In [5]:
num_obs = data.shape[0]
print num_obs
That's quite a bit of data to work with. Now, let's do a bit more analysis on the distribution using the pandas
values_counts
method.
In [6]:
# Make a new histogram of classes
histogram = data["WhaleID"].value_counts()
In [7]:
# Turn it into a dictionary for later use
# The dictionary is in the form {"Whale_ID" : num_observations}
histogram_dict = histogram.to_dict()
How about we plot all the classes that have more than 20 examples?
In [8]:
# This code looks is a little complicated, so let's break it down.
# First we are using a map expression to 'map' each row index of our data
# frame into a boolean value that tells us whether we want to include that
# row of our data frame for the indices variable.
# The first argument to the map method is a function on the indices.
# The second argument to the map is the list of our data frame indices.
# The function looks at a row of the data frame given by a particular
# index, accesses its "WhaleID" value, passes it to the histogram_dict
# we created earlier, returns the number of observations belonging to
# that class, then returns true or false dependent on whether the returned
# value is greater than 20 or not.
# Link: http://pandas.pydata.org/pandas-docs/stable/indexing.html#boolean-indexing
# Link: http://www.python-course.eu/lambda.php
indices = map(lambda x: histogram_dict[data.ix[x,]["WhaleID"]] > 20, range(num_obs))
In [9]:
plot = sns.countplot(x = "WhaleID", data = data[indices], palette = "Greens_d")
# The below code fails b/c it requires the labels
# Link: http://stackoverflow.com/questions/26540035/rotate-label-text-in-seaborn-factorplot
# plot.set_xticklabels(rotation=90)
# So use this label adjustment code instead
# Link: http://stackoverflow.com/questions/31859285/rotate-tick-labels-for-seaborn-barplot
for item in plot.get_xticklabels():
item.set_rotation(80)
If we're too lazy to count how many of these classes there are, we could just do it this way:
In [16]:
# Link: http://stackoverflow.com/questions/12765833/counting-the-number-of-true-booleans-in-a-python-list
print sum(histogram > 20)
Let's plot all the classes with less than or equal to 20 observations.
In [11]:
indices = map(lambda x: histogram_dict[data.ix[x,]["WhaleID"]] <= 20, range(num_obs))
plot = sns.countplot(x = "WhaleID", data = data[indices], palette = "Purples_d")
plot.xaxis.set_major_formatter(plt.NullFormatter())
To start out, what if we wanted to select the rows of the data frame that corresponded to an arbitrary set of classes, say, whale_36861
and whale_95270
? (By the way, bonus points if you can figure out why I selected these two.) The way we do this is simple.
In [17]:
# Link: http://stackoverflow.com/questions/7571635/fastest-way-to-check-if-a-value-exist-in-a-list
two_whales_data = data[map(lambda x: data.ix[x,]['WhaleID'] in ['whale_38681', 'whale_95370'], range(num_obs))];
Now we can print out our data.
In [18]:
print two_whales_data
What if we wanted to use our previously created histogram
variable to return a list of data frames grouped by class? This is easy, too.
In [60]:
def return_data_frames_by_class(data_frame, class_list, y_column_name):
df_list = [];
num_obs = len(data_frame.index)
for class_name in class_list:
bools = map(lambda x: is_row_part_of_class(x, data_frame, class_name, y_column_name), range(num_obs));
df_list.append(data_frame[bools]);
return df_list
def is_row_part_of_class(data_frame_index, data_frame, class_name, y_column_name):
return data_frame.ix[data_frame_index,][y_column_name] == class_name
In [61]:
class_list = histogram.axes[0].tolist()
df_list = return_data_frames_by_class(data, class_list, 'WhaleID')
Let's see what the first element of our df_list looks like..
In [63]:
print df_list[0]
It prints out just what we expected. Nice!
Well, that's it for this notebook. In our next notebook, we will create a very non-statistical sampling method as a first attempt at increasing the size of our dataset and improving the distribution of observations between classes. Thanks for reading!